The quantification of Simpson’s paradox and other contributions to contingency table theory

The analysis of contingency tables is a powerful statistical tool used in experiments with categorical variables. This study improves parts of the theory underlying the use of contingency tables. Specifically, the linkage disequilibrium parameter as a measure of two-way interactions applied to three-way tables makes it possible to quantify Simpson’s paradox by a simple formula. With tests on three-way interactions, there is only one that determines whether the partial interactions of all variables agree or whether there is at least one variable whose partial interactions disagree. To date, there has been no test available that determines whether the partial interactions of a certain variable agree or disagree, and the presented work closes this gap. This work reveals the relation of the multiplicative and the additive measure of a three-way interaction. Another contribution addresses the question of which cells in a contingency table are fixed when the first- and second-order marginal totals are given. The proposed procedure not only detects fixed zero counts but also fixed positive counts. This impacts the determination of the degrees of freedom. Furthermore, limitations of methods that simulate contingency tables with given pairwise associations are addressed.


Introduction
Categorical variables are observed in many branches of science. Contingency table theory serves to infer such data. A great spectrum of analytical methods was presented by Agresti [1]. In the present paper, some parts of the theory are improved and some methods are added.
In their historical overview, Fienberg and Rinaldo [2] recognized Bartlett's [3] important contribution to the theory of contingency tables. Simpson [4] clarified some remaining questions from Bartlett's'paper on the three-way interaction in a 2×2×2 table. In addition to theoretical results, Simpson gave an example in which the health benefits of a drug appeared separately in both males and females. However, if the data were merged, no effect was seen. Furthermore, Blyth [5] showed that the merged data might even indicate a strong negative effect of the drug. This phenomenon was called "Simpson's paradox" . Several examples have been found in real life, demonstrating the principle's great practical relevance and the many different situations in which it may arise. Many studies have investigated how to circumvent this paradox, how best to deal with it, or how to interpret it (e.g., [6][7][8][9][10][11][12][13][14]). However, no short and elucidating presentation has so far succeeded in showing the relation of the paradox and the inner structure of the table. Different measures are used for the association between two categorical variables, particularly for a 2×2 table (odds ratio, Yule's Q, Pearson's φ and ρ). Quantitative genetics, for example, uses the so-called linkage disequilibrium (LD). The application of LD to the two-way and partial associations delivers a closed formula quantifying Simpson's paradox. The formula is derived in Section 2 and applied in Section 7.1, and it allows a clear, correct, and straightforward interpretation of a famous Berkeley data set.
In a strong sense, Bartlett [3] did not investigate a "three"-way interaction but a "third"-way interaction for a 2×2×2 table. He considered the question of whether a third variable (sex) has an effect on the association between the other two variables (success and treatment). He suggested comparing the odds ratios of the partial 2×2 tables (one for males and one for females). When they agree, the third variable has no effect.
Simpson [4] realized that Bartlett's definition of no three-(or third-) way interaction implies a symmetry property: when the third variable has no effect on the interaction between variables one and two (agreeing odds ratios of both sub-tables), then automatically, the first variable has no effect on the interaction between variables two and three, and the second variable has no effect on the interaction between variables one and three. Therefore, Bartlett's [3] test on "no three-way interaction" is a global one, and the alternative hypothesis would be "there is at least one variable with three-way interaction". Although such a test is not senseless at all, it is hard to believe that someone is interested in whether the interaction between treatment and sex for the group of successful patients equals the interaction between treatment and sex for the group of failed patients. Therefore, a test for a single variable ("sex has no influence on the effect of a drug" versus "the effect of the drug differs between males and females") is still needed. It is clear that, for such a test, the odds ratio is not a suitable measure. A measure of association is needed that does not have the symmetry property. Simpson [4] mentioned that symmetry is lost for the root mean square contingency parameter, what we now call the correlation coefficient. However, he did not investigate this measure. It appears that this important issue has not been treated elsewhere so far, possibly because it does not fit the hierarchical log-linear model approach. In Section 3, this gap in the theory is closed. The method is applied to the Berkeley data in Section 7.2.
In quantitative genetics, the concept of LD has been generalized to three and four variables as the so-called three-or four-locus LD [15][16][17][18][19][20]. The three-locus LD of Bennett [15] is an additive measure and related to the additive measure of Lancaster [21,22]. It was shown [23,24] that this measure is not consistent with Bartlett's criterion, which is actually the solution of a cubic equation.
Bartlett's criterion, although appearing intuitive, turned out to agree with the maximumlikelihood equation of the log-linear model. Streitberg [25,26] discussed the shortcomings of the log-linear model, treated the tables as multinomial distributions, and argued for additive measures. Obviously (and unfortunately), he was not aware of the investigations into tables and entropy performed by Good [27].
Shannon's [28] principle of entropy is a successful concept in physics, engineering, information theory, and statistics. Khinchin [29] delivered mathematical foundations for this principle. In particular, he investigated a measure H for the information content of an experiment (with a finite size n of possible events) as a functional of the probability function. The higher the value of H, the lower the information content of the experiment. He made two assumptions: (i) H is largest when the events have unique probabilities 1/n and (ii) if an experiment consists of two experimental parts, A and B, then the information content of the whole experiment, H(AB), should be the sum of the information content of the first part H(A) and the information content of the second part, given the first part, denoted by H(B|A), i.e.,

HðA BÞ ¼ HðAÞ þ HðBjAÞ: ð1Þ
He showed that, under these reasonable assumptions, there is only one measure that is continuous: the entropy H ¼ À l P n i¼1 p i ln p i , where λ is a positive constant and often set to one, i.e., Good [27] treated contingency tables as multinomial distributions and determined the distribution with maximum entropy and given restraints, such as one-and two-way marginals. It turned out that his solution for 2×2×2 tables agreed with Bartlett's criterion.
There is another point speaking against Bennett's linear measure. In genetic multi-locus linkage analyses, Hill [23] showed that a table with an absence of three-way interactions may have negative "probabilities". That is, given a table with a three-way interaction, the corresponding hypothetical table without a three-way interaction would not exist. Such a dilemma cannot arise by applying the entropy principle because of its concavity.
It can be concluded that, for 2×2×2 tables, the multiplicative measure has a deeper impact than the additive one. On the other hand, the additive measure is much more tractable. Therefore, we ask which additive measure comes nearest (is most similar) to the multiplicative one. Section 4 examines whether Bennett's measure is the first-order Taylor expansion of Bartlett's measure.
A central theme in the progress of contingency table theory is the introduction and development of the log-linear model. In their historical overview, Fienberg and Rinaldo [2] show that a special point was the difficulty in handling zero counts. The nonexistence of the maximum-likelihood estimator (MLE) was indicated by the lack of convergence of the algorithms used to compute the MLE. Later, Fienberg and Rinaldo [30,31] generated a numerical procedure specifically designed to check for the existence of the MLE. They based their approach on investigations of extended exponential families and the geometrical properties of log-linear models. Practically, the question about zero counts was whether the marginal totals enforce the cells to have zero counts. In such cases, the cell is fixed and this therefore also influences the degrees of freedom. So far, it has been overlooked that not only zero count cells but also positive count cells might be fixed. Section 5 presents an elementary algorithm that detects all fixed cells.
There are variables with categories that have an obvious order, and such variables are called ordinally scaled. [32][33][34][35] documented the progress and problems with simulating ordinally scaled variables with given pairwise Pearson's correlation coefficients. The techniques are modifications and adaptations of simulation techniques for multivariate normally distributed variables with a given correlation matrix. However, there are no procedures available that work for every admissible correlation matrix. Section 6 presents a simulation method that has no such theoretical limitations.
[35] handled the same task but with demanded pairwise associations measured with Goodman and Kruskal's γ. [36] generated a program for Lee's procedure. Although the authors did not mention it, the method is not suitable for simulating all admissible scenarios. These shortcomings are overcome in Section 6.
In Section 7, a real data set reflecting Simpson's paradox is analyzed with tools derived in Sections 2 and 3.
The paper concludes with a discussion of the issues. Special attention is given to the application of the entropy principle.

The quantification of Simpson's paradox
Let X and Y be two random categorical variables with I X and I Y categories, respectively. In an experiment, n objects are inspected to identify which categories of variables X and Y apply. The counts n i,j , i = 1,2,� � �,I X , j = 1,2,� � �,I Y , are written in a I X ×I Y contingency table. The probability that an object matches categories X i and Y j is p i,j = P(X = X i^Y = Y j ), and its estimate is n i,j /n. The association between categories X i and Y j is defined by the linkage disequilibrium (LD) measures: The point indicates summation over the assigned variable, e.g., It is easy to check that D X i ;Y j ¼ p i;j p � i; � j À p i; � j p � i;j holds. Pearson's correlation coefficient can then be written as ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi which coincides with Pearson's φ.
With Z being a third categorical variable, the cell probabilities of the associated I X ×I Y ×I Z table are now p i,j,k = P(X = X i^Y = Y j^Z = Z k ), k = 1,2,� � �,I Z . Eq (3) then change to Using the definition of conditional probabilities, p i,j|k = p i,j,k /p •,•,k , the conditional analogue to Eq (3) is The result is formulated as a theorem. THEOREM: For an I X ×I Y ×I Z table, the difference between the two-way LD and the weighted sum of the partial LDs is For a 2×2×2 table, the difference becomes The simplification for the 2×2×2 table follows from inserting I Z = 2 into Eq (8) and regarding the well-known formula D A;

Testing the equality of partial interactions for one variable
The null hypothesis for Bartlett's test concerning 2×2×2 tables is the agreement of all partial interactions (measured as the odds ratio), while the alternative hypothesis is that at least one pair of partial interactions is unequal. Here, the effect (if any) of the third variable on the interaction between the first and the second variables is inferred. Let the first variable be the outcome of the experiment with categories "success" and "no success", the second variable be the applied treatment with categories "1" and "2" (one treatment could be a placebo), and the third variable be the sex of the patient with categories "male" and "female".
The null hypothesis is that both partial interactions of the third variable coincide. The hypothetical table with agreeing partial interactions and the observed table have several parameters in common: three two-way marginal totals, three one-way marginal totals, and the sample size (zero-way marginal total). The equational system for the probabilities is then : ð11Þ With eight cells and seven conditions, there is one free parameter, p 1,1,1 . The partial tables for male and female patients are presented in Table 1.
There is certainly no effect due to sex if both sub-tables agree. Solving this system of linear equations gives p 1,2 = p 1 p 2 , p 1,3 = p 1 p 3 and p 2,3 = p 2 p 3 ; i.e., all variables were pairwise independent.
If the odds ratios of both sub-tables agree, this would apply also for the sub-tables of the other variables, as acknowledged by Simpson [4]. Hence, it would not be a specific property of sex.
Thinking about LD, which is a relative measure (since the maximum and minimum depend on the one-way marginals), and the correlation coefficient, the better measure for associations in agreement will be the correlation coefficient.
Bennett [15] introduced an additive measure of the three-way association: In the introduction, it was concluded that the multiplicative measure (15) has more impact and the linear measure (16) could be an approximation. Therefore, it can be checked whether the linear measure could be the first-order Taylor expansion of the nonlinear one.
Substituting the LD expressions p i,j = D i,j +p i p j for the two-way probabilities of Eq (11) leads to pðp 1;1;1 Þ ¼ of zero and one-way probabilities of one half were carried out. The real solution was p ðTaylorÞ The appropriate measure for the three-way interaction would be D ðTaylorÞ ¼ p 1;1;1 À p ðTaylorÞ 1;1;1 . It can be seen that Bennett's measure differs from this but covers the four simplest terms. Therefore, Bennett's criterion can be interpreted as a simplified version of the first-order Taylor expansion of Bartlett's criterion. The second-order expansion was also available. The only unexpected result was that one coefficient was not a power of 2: the largest term was À 29 � 2 12 ð1 À 2p 1 Þð1 À 2p 2 Þð1 À 2p 3 ÞD 2 1;2 D 2 1;3 D 2 2;3 . The other coefficients were the following (excluding linear terms): 2 12 (three terms), 2 9 (three terms), 2 8 (three terms), and 2 5 (six terms).

The application of linear programming
Assume an observed I 1 ×I 2 ×� � �×I c contingency table n, i.e., there are c categorical variables, with I i categories per variable i, i2(1,2,� � �,c}. The contingency table n is characterized by the counts n i 1 ;i 2 ;���;i c , with n ¼ n �;:::;� ¼ P i 1 ;i 2 ;���;i c n i 1 ;i 2 ;���;i c counts overall. The one-way marginal totals n i k i ¼ n �;...;�;k i ;�;...;� can be written as where k i , 1�k i �I i , is a category of variable i. Analogously, the two-way marginal totals n i k i ;j k j ¼ n �;::;�;k i ;�;::;�;k j ;�;::;� can be written as where i and j define the involved variables and k i and k j define the appropriate categories. The marginal totals with indices I i , i2{1,2,� � �,c} can be determined by the others: Let m be the vector of the considered marginal totals. Then, m can be written as m ¼ ðn; n 1;�;...;� ; . . . ; n I 1 À 1;�;...;� ; n �;...;�;1 ; . . . ; n �;...;�;I c À 1 ; n 1;1;�;...;� ; . . . ; n �;...;�;I cÀ 1 À 1;I c À 1 Þ 0 : We order the cells of n into a vector n ! , with and thus establish a one-to-one relation between n ! j and n i 1 ;i 2 ;...;i cÀ 1 ;i c , with If j is given, the corresponding c-tuple i 1 , i 2 ,. . .,i c−1 , i c must be evaluated sequentially. The length of n ! is d ¼ Q c k¼1 I k . Then, the restraints can be formulated as where matrix A has entries zero or one and ensures the addition of the demanded components of n ! . For a 2×2×2 table, matrix A can be seen in Eq (10). In the first row of A, there are only ones, ensuring the addition of all components of n ! . In the second row, there is a one if the corresponding component of n ! has category one at the first variable, etc.
We now introduce a tableñ with the same dimensions as the observed table n. As with Eq (23), the unknown cell countsñ i 1 ;i 2 ;���;i c are ordered into a vector n , with Since we are looking for a tableñ that has the same zero-, one-, and two-way marginals as the observed table n, must be valid, where matrix A is the same as before. Then, the set S with where n � 0 is meant componentwise and ensures nonnegativity, contains all admissible tables satisfying the constraints.
There is at least one element of S: the observed table n. Assuming that there are two elements in S, n 1 and n 2 , then the linear combination λ n 1 +(1−λ) n 2 with 0�λ�1 is also admissible. Hence, the set of admissible tables is convex and, furthermore, the theory of linear optimization is applicable. In particular, there exist unique solutions (e.g., see [37]) for the linear optimization problems leads to the set, say O, of the components for which the equality is valid. Several numerical software packages contain a linear programming or optimization procedure. As an example, the 4×4×4 data from Table 6 of Fienberg and Ricardo [2] was analyzed, with the results presented in Table 2. There were 24 zero and 12 nonzero counts, which turned out to be fixed. Table 2 provides more information (such asñ 1;4;2 ¼ 6 Àñ 1;1;2 Þ than can be obtained from the described algorithm. We will come to that now.

The application of algebraic software
Applying algebraic software such as Mathematica [38] to problems (28A) and (28B) has an advantage compared to pure numerical algorithms. The system of equations m ¼ A n can be solved using the procedure "Solve", thus decreasing the number of variables. For a 2×2×2 table, the solution of Eq (24) is Here, the representation x = x h +x s is used [37], where x h is the solution of the homogenous system, i.e., 0 = A x h , and x s is a special solution, i.e., m = A x s . Eq (29) is just Eq (11) times n. From the eight initial variables of the table, there is only one left:ñ 1;1;1 .
In the general case of an I 1 ×I 2 ×� � �×I c table, there are d ¼ Q c i¼1 I i cells or initial variables. In matrix A of Eq (24), there are P c i¼1 ðI i À 1Þ linear independent rows for the one-way totals, P cÀ 1 i¼1 fðI i À 1Þ P c j¼iþ1 ðI j À 1Þg linear independent rows for the two-way totals, and one row for the overall total n. The number, r, of linear independent rows is therefore Table 2. Cell counts for given zero-, one-, and two-way marginal totals for Table 6 of Fienberg and Rinaldo [2]. (The original table is obtained by insertingñ 1;1;2 ¼ 4 andñ 1;1;3 ¼ñ 2;3;1 ¼ñ 3;2;1 ¼ 1).
When O is not empty, the analysis can be refined. In Eq (26), the fixed counts substitute for the variables of the fixed cells. This means that vector n in m ¼ A n has to be renewed by set- Analogously, the list of variables is reduced by canceling the variables of the fixed cells. The solution of the new system of equations then further reduces the number of free variables. Applying this to Table 6 of Fienberg and Rinaldo [2] yielded Table 2 of this study. Only four free variables are left over. In this way, a very compact expression for admissible tables was reached.

Association measured by Pearson's ρ
The aim is to simulate an I 1 ×I 2 ×� � �×I c contingency table of c numerical variables with given one-way marginals and Pearson's correlation coefficients ρ i,j , i,j2{1,2,� � �,c}. The categories of the variables are characterized by numbers. These numbers, v i ði k i Þ, i2{1,2,� � �,c}, k i 2{1,. . .,I i }, may agree with the index of the category, i.e., v i (i 1 The contingency table is characterized by the probabilities p k 1 ;k 2 ;...;k c , where k i 2{1,. . .,I i } defines the category of variable i. The probabilities are unknown at present. Now the equations are collected to ensure the validity of the given conditions.
The one-way probabilities, p i k i , are assumed to be known. Here, i is the number of the variable and k i is the category of that variable. As before, p i k i may be obtained from the c-way probabilities, p k 1 ;k 2 ;...;k c , by summing over all indices except index i, i.e., Hence, the expectations m i ¼ of the variables i2{1,. . .,c} are also known. The pairwise correlation coefficients, ρ i,j , are defined by where p i k i ;j k j are the two-way probabilities of variables i and j, which can be determined via ...;�;j k j ;�;...;� . Hence, must hold. The left-hand side of this equation only involves constants; i.e., the left-hand side is a constant. The right-hand side of Eq (37) is a linear combination of the cell probabilities; therefore, the theory of linear programming can be applied. For convenience, Mathematica and the principles of Section 5.2 are used to proceed. First, the system W of the linear equations (1 ¼ p �;...;� ; P c i¼1 ðI i À 1Þ ¼ I � À c equations for the oneway marginals and (c−1)c/2 equations for the two-way correlations) is solved.
Then, the lower and upper bounds for the first free variable are determined using the procedures Minimize and Maximize of Mathematica. Three cases need to be considered. (1) The procedure finds no solution, in which case there is no table satisfying the demanded correlations (in the literature, there was no practical and sufficient criterion for the existence of a table). (2) The lower bound and upper bound agree. Then, the first free variable is fixed. (3) The lower and upper bound differ, so there are a variety of tables satisfying the demanded correlations. Therefore, it must be decided whether an average table or an extreme one is preferred. We suggest simulating at least an average table and possibly afterward simulating extreme tables. For the average table, we assign the mean of the bounds to the first free variable. For extreme tables, either the lower or the upper bound can be assigned to the first free variable.
In either case, the first free variable is now assigned to a constant value, and the system of equations W is updated by inserting that value for the variable. Then, the lower and upper bounds for the new first free variable are determined. (The output "no solution" may no longer appear.) This algorithm is repeated until there are no free variables left and all cell probabilities are determined. Now, we have all d ¼ Q c k¼1 I k cell probabilities. We interpret them to define a d-point distribution. Then, the inversion algorithm of Lee [35] can be used to simulate the table.

Association measured by Goodman and Kruskal's γ
This section is organized as follows. First, the algorithm of Lee [35] is described, including three improvements. Then, we use it in two examples showing scenarios of association parameters where a table satisfying the demands does not exist and, even when such a table exists, it cannot be determined with Lee's [35] method. Later, hints are provided for how to handle these problems.
Consider two ordinally scaled categorical variables Y 1 and Y 2 with I 1 and I 2 categories, respectively. Let the (unknown) joint probabilities be denoted by p i,j = P(Y 1 = i^Y 2 = j). Consider two random objects with observations of both variables, . The probability that the first object has categories i and j and the second object has categories i 0 and j 0 is then p i,j p i 0 ,j 0 . In addition to being objects with observations, O 1 and O 2 are also two points of the I 1 ×I 2 table, and they may be concordant (i<i 0 and j<j 0 or i>i 0 and j>j 0 ), discordant (i<i 0 and j>j 0 or i>i 0 and j<j 0 ), or indifferent (at least one equality sign appears). Adding the concordant and the discordant cases, the definition of γ becomes Note that this definition deviates from that of Lee [35] ones. (This is the first improvement by the author.) In the version from [35] or [40] the right-hand-side double sums do not appear. In that case, however, we can obtain differing association values if we rename or interchange the variables. Since this is not judicious in the actual context, the symmetrical version (38) is applied. However, this does not affect the ideas of Lee [35] in an essential way.
Following Lee [35], for given one-way marginals, i.e., for given p ¼ ðp 1;� ; p 2;� ; . . . ; p I 1 ;� Þ0 and q ¼ ðp �;1 ; p �;2 ; . . . ; p �;I 2 Þ0, the maximum gamma is γ = 1. The probabilities p i,j carrying this property can be determined by the following routine. With an outer loop i = 1,2,� � �,I 1 and for each i with an inner loop j = 1,2,� � �,I 2 (or vice versa), set (This is the second improvement by the author. The author thought that Lee [35] meant the same, but his version was hard to understand.) For negative gammas, the method must be modified. [35] and [36] stated that a two-way table with perfect negative association (i.e., γ = −1) can be obtained from the two-way table with perfect positive association (i.e., γ = 1) by reversing the order of categories for one of the variables. To see that this is not correct, consider a table with three variables where all association parameters are γ = 1. Reversing the order of categories of the first variable changes two associations to γ = −1. If one then reverses the order of categories of the second or third variable, there remain two associations with γ = −1 and one with γ = 1. If we then reverse the order of categories of the remaining variable that has not changed so far, we again have three associations of γ = 1. Therefore, a table with three variables, where all association parameters are γ = −1, cannot be generated.
However, the joint probabilities for γ = −1 can be determined by reversing the components of the one-way marginal totals p, i.e., p ¼ ðp I 1 ;� ; p I 1 À 1;� ; . . . ; p 1;� Þ0, applying routine (39), and next reversing the rows of matrix fp i;j g I 1 �I 2 , thus obtaining the table p min with the originally demanded one-way marginal totals. (This was the third improvement by the author.) Denote the generated I 1 ×I 2 table with p opt and the I 1 ×I 2 table for independent Y 1 and Y 2 with p 0 , i.e., p 0 i;j ¼ p i;� p �;j . Then, the convex linear combination p(λ) = (1−λ)p 0 +λ p opt , 0�λ�1, defines a table p(λ) satisfying the one-way marginal totals. For λ = 0, p(0) = p 0 holds and the appropriate gamma is zero, i.e., γ[p(0)] = 0. Also, for λ = 1, p(1) = p opt holds and the appropriate gamma is one, i.e., γ[p (1) where Γ is the nominal amount of association. Therefore, Lee [35] solves numerically the equation The main aim is to generate an I 1 ×I 2 ×� � �×I c table for c categorical variables with given oneway marginal totals and nominal pairwise associations (40) is applied, leading to c(c−1)/2 two-way marginal totals p G i;j . Each entry p G i;j i0;j0 , with i 0 2{1,2,� � �,I i } and j 0 2{1,2,� � �,I j }, can be expressed as a sum of the c-way probabilities, thus exhibiting linear equations. Lee [35] acknowledged that a solution of a system of linear equations with additional inequalities, p i �0, can be found by applying linear programming. Having determined an admissible table, the simulation is carried out with the inversion algorithm.
The described method of determining an admissible table will be called the γ−method from here on.
Neither Lee [35] nor Ibrahim and Suliadi [36] mentioned any problems finding a solution and gave the impression that the procedure always finds one. An example is given to prove that this is not always the case.
It is not necessary to solve (40), since a priori, λ = 1 holds. Now, the zero-, one-, and two-way marginal totals are known, and the system of linear equations can be established. There is one free parameter, and the three-way table satisfying the restraints is presented in Table 3.
One could think that, if an admissible table exists, it can be determined by the γ-method. We now show that this is not correct. As an example, our task is to generate a 3×3×3 table with
However, there is a table satisfying the conditions that was found with the procedure NMaximize from Mathematica. The variable Γ was maximized under the restraints of the zero-and one-way marginal totals Γ = −Γ 1,2 = Γ 1,3 = Γ 2,3 and nonnegative variables. The maximum was Γ = 0.6023, and the obtained table is given in Table 7.
To simplify the check of the side conditions, the sub-tables are given in Table 8. Although there are similarities to the two-way sub-tables in Table 4, there is one specific difference: zeros do appear, supporting an extreme table.

PLOS ONE
The quantification of Simpson's paradox and other contributions to contingency table theory The tool to determine an admissible association parameter scenario is still applied to the 2×2×2 table from above. Since there was no solution for −Γ 1,2 = Γ 1,3 = Γ 2,3 = 1, it would be interesting to find an extreme constellation for which a solution would exist. The term −Γ 1,2 +Γ 1,3 +Γ 2,3 was maximized under the restraints of the one-way marginal totals and nonnegative variables using the procedure NMaximize from Mathematica. The maximum was −Γ 1,2 +Γ 1,3 +Γ 2,3 = 2.714 and the obtained table is given in Table 3B. The determination of the sub-tables and the comparison with the sub-tables of Table 2 shows that the first two sub-tables agree, but the third differs.

Association measured by Somers' d
There is one similarity between Pearson's ρ and Goodman and Kruskal's γ: both take values between −1 and 1. A difference is that ρ = 1 means determinism, i.e., the observation of the category of one variable of an object is sufficient to know the category of the second variable. This is not generally true for γ = ±1, since such an event only indicates that the table with maximum or minimum association is present. In fact, Lee [35] called it misleading perfect (negative) association. The three 2×2 tables shown here for γ = 1, 0, -1 have the same one-way marginals, p 1 = (0.8, 0.2) 0 and p 2 = (0.1, 0.9) 0 .

PLOS ONE
The respective Pearson's correlation coefficients are ρ = 0.1667, ρ = 0, and ρ = −0.6667. That means, for the given one-way marginals, that γ = 1 stands for low (positive) association, while γ = −1 stands for large (negative) association. Hence, the γ-scale is a relative one and worthless without additional information. Following [40], Somers' d is a better measure of association (dependence) between ordinal variables. It is a modification of Goodman and Kruskal's γ. Since a symmetrical version is needed here, the definition of T becomes and the definition of d is The right-hand-side version results from 1 ¼ P It is easy to see that d = 0 holds if the variables are independent, and d = ±1 holds if the category of one variable can be deduced from knowing the category of the other variable, i.e., when the table has a (anti-) diagonal structure. For the 2×2 tables from above, d = 0.087, d = 0, and d = −0.471 hold, respectively.
To give an impression of the relation of Somers' d and Pearson's ρ, the parameters were calculated for the sub-tables of Table 8 To find an admissible table satisfying the one-way marginals and the nominal pairwise association parameters Δ i,j , it is possible to apply a slightly modified version of the γ−method. For a certain pair i,j of variables, p 0 and p opt (which are p max for Δ>0 and p min for Δ<0) are determined as before. It is useful to calculate d for the table p opt . The nominal Δ should reflect less Table 7. 3×3×3 table satisfying must be solved numerically. As with the γ-method, the obtained two-way marginals p(λ � ) are written as a system of linear equations. These are solved by linear programming software. If no solution is obtained, the nominal association parameters need to be weakened. This was the analog to the γ-method. The additional tools presented in Sections 6.1 and 6.2 can be adapted. Consider

How to obtain tables for all admissible associations measured by Somers' d
As was worked out in the last section, the adapted γ-method does not always allow the determination of a table satisfying nominal associations measured by Somers' d, although one exists. It was also reported that a numerical maximization was able to find the solution. However, a large number of iterations were necessary, and the method may fail if the number of variables increases.
Assume that a table p � exists that satisfies the nominal two-way associations measured by Somers' d. Let the expression d � = d(p � ) define this property, where d � is the vector of the nominal two-way associations and d(p) indicates the vector of the actual two-way associations from table p.
From Section 6.1, it is known how to determine a table with given Pearson's correlation coefficients ρ, where ρ is the vector of the two-way correlation coefficients. Denote the generated table with p(ρ). We are looking now for ρ, so that p(ρ) has the desired property with respect to d � , i.e., d � = d[p(ρ)]. This is realized by a minimization procedure: We used the function FindMinimum of Mathematica with starting points ρ = d � and the Euclidean norm. The function p(ρ) had to be specified in two ways, and subsequently, the minima and maxima of the free variables were evaluated. The cell of the actual variable was set to the mean of the minimum and maximum. We denote the specification with p(ρ, mean). When ρ left the admissible region, i.e., when there was no solution for the restraints, the penalty term kd � −d[p(ρ, mean)]k was set to a large value. The argument for which the norm is minimum is named ρ � .
Recall that a solution need not be unique. Assuming independence between all pairs of variables, the related table is determined by multiplying the one-way marginals involved in the specified cells. For this independence table, d = (0, 0, 0) 0 and ρ = (0, 0, 0) 0 hold. If we wish to generate the independence table, given the demand d � = (0, 0, 0) 0 , we must give up the choice of the mean value of the admissible intervals (b l , b u ) of the free variables. Instead, we take that value of the interval that is nearest to the value p ind of the independence table. If b l �p ind �b u holds, p = p ind is taken, and if p ind <b l holds, p = b l is taken. For b u <p ind , the choice is p = b u . We denote this specification with p(ρ, ind). The application of this principle led indeed to ρ � = Obviously, for high associations, the difference between p(ρ, ind) and p(ρ, mean) was not great. For the most extreme associations, −0.308 and 0.594, p(ρ, ind) and p(ρ, mean) result in the same table.
When the minimum of (45) was not zero for a nominal d � , no table for the demands was found. Then, the nearest admissible table due to the used norm was obtained.

Why the two-way LD differs from the partial LDs and their mean
One real-life example for Simpson's paradox is particularly impressive. The University of California, Berkeley, was sued for bias against women who had applied for admission. The reduced data version found at https://en.wikipedia.org/wiki/Simpson%27s_paradox is presented in columns 1-5 of Table 9. Table 9. Numbers of denied and admitted applications at six departments as part of the study [41]. Variable 1 is sex (men-women), variable 2 is admittance (denied-admitted), and variable 3 is the department (1 to 6). D 1;3 i is the LD between variable 1 and variable 3 (which is now dichotomous: Department i versus the rest). D 2;3 i is the LD between variable 2 and variable 3 (which is again dichotomous: Department i versus the rest). Parameter p 3 i stands for the frequency of applications to department i. D 1;2j3 i is the LD between the first category of variable 1 and the first category of variable 2 within Department i, and r 1;2j3 i is the corresponding correlation coefficient.

PLOS ONE
Dividing the number of admitted men by the number of applying men shows a rate of 1198/2691 = 44.5%, while dividing the number of admitted women by the number of applying women shows a rate of 557/1835 = 30.4%. The large difference between 44.5% and 30.4% resulted in a perception of discrimination against women. Therefore, the question was whether women were really handicapped or if there were other reasons that led to the differing rates.
Bickel and collegues [41] examined the department-level data and did not find clear evidence of discrimination against women. Averaged over the departments, they found a moderate preference for women. In principle and qualitatively, this corresponds to the inspection of the last two columns of Table 9. Note that positive values mean that more men than women relative to their frequencies were denied; i.e., more women were admitted. The authors of [41] also worked out the reason for the great discrepancy between the apparent overall handicap for women and the almost absent handicaps within the departments. The reason was the preferred applications of women to departments with low admission rates. However, this reason was not found by straightforward theory but by good detective work.
With the LD approach of Section 2, the parameter for the overall association between sex and admission is D 1,2 . The overall LD is D 1,2 = (1493×557−1198×1278)/4526 2 = −0.0341, showing the handicap for women. (Significance tests should and can be applied, but they are not the focus here.) This approach does not account for the influence of the departments. Therefore, the averaged LDs of the departments D 1;2j3 i is a more reliable parameter. Direct evaluation of � D 1;2j3 via the eighth column of Table 9 gives � D 1;2j3 ¼ 0:0034, reflecting a small preference for women. The difference between both parameters is D 1;2 À � D 1;2j3 ¼ À 0:0375. Now, the result (8) of Section 2 is applied. With it, the difference can be determined in a completely different manner. The difference between D 1,2 and � D 1;2j3 is P 6 i¼1 D 1;3 i D 2;3 i =p 3 i . The first summand, D 1;3 1 D 2;3 1 =p 3 1 , is determined as follows. D 1;3 1 belongs to the 2×2 table for Department 1, where the first column is assigned to "men" and the second column to "women". The first row is assigned to "Department i". The second row is assigned to the complement, i.e., to the rest of departments. In Department 1, 825 men and 108 woman applied. Overall, 2691 men and 1835 women applied; i.e., 2691−825 men and 1835−108 women applied to the other departments. The numbers are presented in Table 10 together with those for denied and admitted applicants at department 1.
The interpretation is that the apparent discrimination against women was caused by a property of the departments. Those with high admittance rates had more male applicants and those with low admittance rates had more female applicants. While Bickel and colleagues [41] had to be good detectives to discover this trend, the new approach makes it obvious immediately. The remark in [41] concerning the role of the size of the departments has to be verified, since p 3 i appears in the denominator in (46). See also Eq (49B).

The determination of parsimonious models fitting the data
The aim is to infer whether a proven three-way interaction is caused by all three-way interaction parameters or only by a subset of them. With the Berkeley data, it is shown that the search for a parsimonious model fitting the data can be successful.
All three-way interaction parameters differing from zero reflect a contribution to three-way interaction. To quantify these contributions, the partial 2×2 tables under the hypothesis of an absence of three-way interactions were compared with the observed ones. The χ 2 values for the six categories were 20.6, 0.1, 2.4, 0.01, 2.1, and 0.1, respectively. Obviously, the first department indeed plays a dominant role.
A table {p i,j,k } fitting the observed table must therefore trim D 1,1,1 to zero. This can be guaranteed by setting the free parameter p 1,1,1 to n 1,1,1 /4526. Then, there remain four free parameters. Theoretically, one could now derive maximum-likelihood estimates to fit them, but the use of the maximum entropy principle under restraints is easier. The restraints are the zero-, one-, and two-way marginals and p 1,1,1 = 0.0692. Then, the four remaining free parameters are determined by numerically maximizing the entropy. The maximum H = 2.886 was reached for p 1,1,2 = 0.0454, p 1,1,3 = 0.0463, p 1,1,4 = 0.0605, and p 1,1,5 = 0.0314. A comparison of the corresponding table with the observed one yielded a χ 2 value of 2.56; i.e., the data were met. For the hypothetical table, the partial correlations r 1;2j3 i for Departments 1 to 6 were 0.136, −0.003, −0.007, −0.007, −0.006, and −0.004, respectively; i.e., with the exception of r 1;2j3 1 , they were absolutely small. The complete table, multiplied by n, is presented in Table 11 under Method A. Now an alternative method is investigated. In Section 3, we worked out for 2×2×2 tables how to infer the agreement of partial correlations. This approach is straightforward to generalize to 2×2×I 3 tables with I 3 >2. (For that, the right-hand-side expression of Eq (14) is useful. There, i substitutes for index "1" and I 3 substitutes for index "2". The same substitutions are necessary for the third indices of A and B.) Then, the global hypothesis for the Berkeley data would be that all partial correlations with respect to the third variable agree. It can be formulized by demanding r 1;2j3 1 ¼ r 1;2j3 2 ¼ � � � ¼ r 1;2j3 6 ; i.e., there are actually five equations. Since there are also five degrees of freedom, the hypothetical table may be calculated explicitly. The hypothetical agreeing partial correlation coefficients became 0.019. A comparison of the hypothetical 2×2×6 table with the observed data gave a χ 2 value of 17.4 (p = 0.0036); i.e., one cannot be convinced with unique correlation coefficients. Comparisons of the partial 2×2 tables with the observed ones showed one significant deviation. For Department 1, the χ 2 value became 19.2. All other χ 2 values were smaller than 2.1.
Comparisons of the partial 2×2 tables with the tables for independence showed no significant deviation. All χ 2 values were smaller than 0.4. That means 0 ¼ D 1;2j3 2 ¼ D 1;2j3 3 ¼ � � � ¼ D 1;2j3 6 can be assumed. These five equations are solved by the five arguments p 1,1,1 = 0.0683, p 1,1,2 = 0.0455, p 1,1,3 = 0.0466, p 1,1,4 = 0.0608, and p 1,1,5 = 0.0316. The comparison of the associated table with the observed data gave a χ 2 value of 3.69; i.e., the model fits the data. Comparisons of the partial 2×2 tables with the observed ones also showed no significant deviation. All χ 2 values were less than 1.2. Comparisons of the partial 2×2 tables with those under independence showed one significant deviation. For Department 1, the χ 2 value became 10.8 (p = 0.001, the correlation was r 1;2j3 1 ¼ 0:107, somewhat smaller than before). All other χ 2 values were of course zero. The table is presented under Method B in Table 11.
Method B gained from the finding that five partial correlations could be set to zero. Under different circumstances, it could be possible that the five partial correlations agree but are not zero. In that case, r 1;2j3 i ¼ r 1;2j3 iþ1 can be assumed for i2{2, 3, 4, 5}. For these four equations, four variables can be eliminated. The last variable, p 1,1,1 , is determined via maximum likelihood, i.e., by maximizing ∑ i,j,k n i,j,k ln(p i,j,k ). The solution can be viewed under Method C in Table 11. The comparison of the table with the observed data gave a χ 2 value of 2.73; i.e., the model fits the data. Comparisons of the partial 2×2 tables with the observed ones also showed no significant deviation. All χ 2 values were less than 0.72. Comparisons of the partial 2×2 tables with those under independence showed one significant deviation. For Department 1, the χ 2 value became 16.7 (p = 0.00004, the correlation was r 1;2j3 1 ¼ 0:134). All other χ 2 values were less than 0.03.
Thus, the apparent discrimination against women with respect to admittance turned out to be untrue. In Departments 2 to 6, men and women were admitted equally. In the first department, men had a significant handicap. Table 11. Fitted counts of the Berkeley data. Five free variables were fitted in three ways. A: First variable n 1,1,1 taken from Table 9, four from maximizing entropy. B: Five from D 1,1,i = 0, i = 2,3,. . .,6. C: Four from agreeing r 1;2j3 i ¼ r 1;2j3 iþ1 ; i ¼ 2; 3; 4; 5, with the fifth the log-likelihood estimate.

Dep.
Men

Discussion
In Section 2, the LD parameter was used to quantify Simpson's paradox. The difference between a two-way interaction and the averaged partial interactions was derived. For a 2×2×2 table, the difference was (Note that the notation D 1,2 , e.g., means the LD between the first category of variable 1 and the first category of variable 2, formerly denoted by D 1 1 ;2 1 .) In many experiments, there is one response variable (here it is the first one) and two explanatory variables. The latter ones can be arranged to ensure D 2,3 = 0, for example, by applying the treatments to the same fraction of males and females. In this way, the difference is zero and Simpson's paradox is circumvented.
Let D 1,3 and D 2,3 differ from zero. One could think that the difference is largest when p 3 is near zero or one. However, the value of LD depends on the one-way marginal totals. Using the correlation coefficients (5) instead gives ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Thus, the absolute difference is largest when p 1 and p 2 are one half, and it is smallest when p 1 or p 2 are zero or one. On the other hand, when p 1 and p 2 are zero or one, the associated LD, i.e., D 1,2 , is zero. Therefore, it is useful to also consider the relative difference: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi For an I 1 ×I 2 ×I 3 table, the appropriate expression is When we are interested in the association between two categorical variables, such as sex and admission at a university, it is useful to determine D 1,2 or ρ 1,2 . If one finds preference for one sex, this does not mean that the other sex experienced discrimination. The reason could be that the abilities of the sexes happened to be different. Therefore, it was reasonable to consider an index for the high school report as a factor. With the Berkeley data, it turned out that the departments need to be considered as a factor. When a third factor has an effect, then � D 1;2j3 gives a better estimate for the association of the two variables than D 1,2 . However, the value of D 1,2 is still useful. If the difference D 1;2 À � D 1;2j3 is greater than zero, it follows automatically from (47) that D 1,3 and D 2,3 cannot be zero and they have the same signs. If the difference is zero, it follows that D 1,3 or D 2,3 are zero. If the difference is smaller than zero, D 1,3 and D 2,3 cannot be zero and they have different signs; i.e., the interactions have different directions.
Hence, Eqs (8) and (9) are a great help for interpreting the tables. It is particularly interesting that their validity is independent of the free parameters.
Section 3 was dedicated to the question of whether the amount of interaction between two variables depends on another categorical variable. In Section 7.2, the approach was generalized to 2×2×I 3 tables and applied to the Berkeley data. It was possible to find parsimonious models that fit the data.
If all variables have more than two categories, i.e., for a general I 1 ×I 2 ×I 3 table, the third variable has no effect on the associations between the first and second variable if holds for i = 1,2,� � �,I 1 −1,j = 1,2,� � �,I 2 −1, and k = 1,2,� � �,I 3 −1. Analogously to (12), (13), and (14), this system of linear equations can be solved, thereby delivering the hypothetical table that can be compared with the observed one. If it does not fit the data, subsequently it can be checked whether the k−th category of the third variable plays a special role. For each k, the (I 1 −1)(I 2 −1) equations of (50) are solved. The remaining (I 1 −1)(I 2 −1)(I 3 −2) free variables are found by maximizing entropy. For each k, the hypothetical table can again be tested against the observed table.
When there is still no hit, the largest deviations from an average ρ can be searched in different ways. There is a need for further investigations into an optimal systematic strategy to find a parsimonious model. The model choice and multiple testing theories have to be kept in mind.
However, the new approach is suited to answering important questions and surely enriches the theory of contingency tables.
In Section 4, the relation between Bartlett's and Bennett's measure on the three-way interaction was investigated. As summarized in the introduction, Bartlett's measure (which he mentioned came from R.A. Fisher) had a high degree of impact, while Bennett's measure was a generalization of the two-way LD based on intuition. The meaning and correctness of this measure could therefore only be checked through its relation to Bartlett's measure. As it turned out, it is a simplified version of the first-order Taylor expansion of the latter one.
Unfortunately, the roots of the seven-degree polynomial arising for the multiplicative measure cannot be determined algebraically, and the Taylor expansion cannot be generated directly. However, the criterion is a function of the parameter p 1,1,1,1 , which is itself a function of the one-, two-, and three-way marginal totals. Thus, further progress depends on the availability of an effective algorithm to derive multivariate Taylor expansions for implicit functions. Since the implicit function is a polynomial (where the order of derivatives unequal to zero is finite) and there is a high amount of symmetry, there appears to be hope.
The focus in Section 5 was on tables with zero counts. The question was whether these counts appeared by chance or whether they were a necessary consequence from the given twoway marginal totals. The application of linear programming was successful in obtaining fixed zero counts. Furthermore, fixed nonzero counts can be determined.
One example of Fienberg and Rinaldo [2] was reanalyzed and lead to Table 2. For completeness, the other examples were also investigated. For their Tables 4 and 7, no fixed cells were obtained. For their Table 5, all cells turned out to be fixed. Fienberg and Rinaldo [2] characterized this table as yielding no MLE and wrote, "In fact, the values of both goodness of fit statistics will always be almost zero, no matter what the positive counts are". This underlines that they did not acknowledge that the contingency table was the only one with the given marginal totals. So far, there was no tool available to find this simple but important truth.
In cases where the number of variables could be reduced, such as with Table 2, the question arises whether the determination of the MLEs can be optimized. One way would be to modify commonly used procedures. Alternatively, Good's [27] method of maximizing the entropy under restraints can be used. Numerical maximization of the entropy of the data of Table 2 (divided by n = 113), given the two-way marginal totals, yielded p 1,1,2 = 0.0316, p 1,1,3 = 0.0120, p 2,3,1 = 0.0102, and p 3,2,1 = 0.0081. Due to the concavity of entropy, the convergence was excellent. The other cell counts can be calculated according to the expressions in Table 2.
In Section 6, improvements were achieved for the simulation of ordinally scaled variables. The main task was to determine an admissible table satisfying the demands. As noted above, there might be difficulties in obtaining an admissible table, simply because such a table does not exist when the restraints are too strong. Ignoring the assumptions p i �0, a solution would exist (if the number of equations expressing the restraints does not exceed the number of variables). However, the bona fide table would have negative entries. Therefore, one can search for an admissible solution by minimizing where the free variables must be fit. If the obtained minimum is zero, an admissible solution is found. Alternatively, the maximum entropy principle under restraints can be used. When a bona fide table has negative entries, the entropy becomes complex. Therefore, it is useful to minimize where Im(x) is the imaginary part of x. Our limited experiences would emphasize the latter method. Lee [35] derived an algorithm for simulating nominal variables with given pairwise correlations measured with Goodman and Kruskal's τ, 0�τ�1. Since the original measure does not ensure τ(Y 1 , Y 2 ) = τ(Y 2 , Y 1 ), Lee suggested the symmetric measure τ = Max[τ(Y 1 , Y 2 ), τ(Y 2 , Y 1 )]. This measure is one when τ(Y 1 , Y 2 ) or τ(Y 2 , Y 1 ) is one. However, a maximum correlation of one should only appear when both τ(Y 1 , Y 2 ) and τ(Y 2 , Y 1 ) showed a correlation of one. Therefore, it is more appropriate to define τ = [τ(Y 1 , Y 2 )+τ(Y 2 , Y 1 )]/2.
It can be shown that nominal variables result in similar problems as with ordinally scaled variables. Their treatment is analogous to that presented in Sections 6.2 and 6.3. Unfortunately, the method from Section 6.4 cannot be applied. The reason is that a measure for nominal variables is invariant with respect to permutations of the categories, while this does not apply to the correlation coefficient generally.
Although a lot of care was spent on the simulation of tables with nominal pairwise association measures, it seems that the meaning of such scenarios is limited. In practice, when an observed table is analysed, it is more important to simulate either tables under a null hypothesis or tables under different alternative hypotheses. In both cases, the two-way marginal totals can be viewed as fixed. With given two-way marginal totals, the two-way associations can be determined. (When there are c variables, even the c-way marginal totals can be viewed as fixed.) Then, it remains to define the properties of the table to simulate and to determine the cell properties. The simulation can then be carried out with the inversion method of Lee [35].
The statements resulting from such simulations are normally about the effect of certain properties of a table. This is only correct when there is just one table with the properties. Commonly, there are several such tables; i.e., it is necessary to simulate at least some extreme tables (where the cell probabilities are edges of the convex set of admissible tables) and an average table (e.g., the table with the given restraints and maximum entropy).
In this study, Good's [27] investigations on maximum entropy under restraints were repeatedly used. It allows us to determine hypothetical tables without knowing the MLEs of the log-linear model explicitly, as it suffices to formulate the equations of the hypotheses.
Note that the row for maximum entropy satisfies the theoretical results of Roy and Kastenbaum [39] on the MLE for the log-linear model with given zero-, one-, and two-way margins. The approach [43] yielded the correct result only in one case.
Due to the concavity of entropy, the maximum is a global one. This explains why Bartlett's criterion, which is a cubic equation, has one real and two complex solutions under all admissible circumstances.
This fact makes it easy to prove that Bartlett's and Bennett's criteria agree if p 1 = p 2 = p 3 = 1/ 2 holds or if at least two of the three two-way interactions are zero. When these conditions are substituted into (17), the criterion (15) with the appropriate cell frequencies can be expanded. The result is in both cases Good's [27] paper on maximum entropy under restraints, however, was sometimes overlooked. For example, Streitberg [25,26] did not include this approach in his contemplations. When Fienberg and Rinaldo [2] cited Good, they did not write about entropy, and when Fienberg and Rinaldo [30,31] wrote about entropy, they did not cite Good. The authors of [43] wrote about entropy without recognizing Good's results. They stated that an equivalence test for the independence between one variable and the remaining two in [39] was not correct. For a 2×2×2 table, the statement of [39] can be formulated as follows: Assuming D 1,3 = D 2,3 = 0, the validity of Bartlett's criterion, i.e., D = 0, is equivalent to p i,j,k = p •,•,k p i,j,• for i,j,k2{1,2}.
Leaving out the constants, the derivative of the likelihood function is The constraints investigated in this study, given one-, two-, or three-way marginal totals, lead to cells n i or p i , which are linear combinations of free parameters and given constants. Actually, each free parameter x appears in the cells either as x or-x; see Eqs (4), (11), (17) and Table 2. Therefore, dn i /dx is either 1 or -1. Thus, the derivative (59) is zero if is satisfied. The plus sign means that summation has to be taken over all cells where x appears as +x, and the minus sign means that summation has to be taken over all cells where x appears as-x. Both cases appear with the same frequency. Therefore, Eq (60) indicates that the agreeing harmonic means guarantee an optimum. It can be shown that the second derivative is strictly positive; i.e., solving Eq (60) gives the value x for which the likelihood is minimal. Two examples of 2×2 tables are presented in Fig 1. One can see in Fig 1 that the minimum likelihood corresponds to the maximum entropy. For the first example, the condition (60) for the minimum of L, and therefore also for ln L, is The solution is x�12.3, while the maximum entropy appears for x = 12. The second example particularly shows the goodness of the asymptotic expression, as it nearly agrees with the exact one.
The asymptotic condition for the minimum likelihood (60) has a special relation to the maximum entropy principle. When Good [27] determined the maximum entropy in the same context as here, he found the condition Y A comparison of this condition with condition (60) proves that the equality of geometric means applies for maximizing the entropy while the equality of harmonic means applies for minimizing the likelihood.
The most elementary restraint is that the number of observations is just n. Then, we can write n k ¼ n À P kÀ 1 i¼1 n i , and both Eqs (60) and (62) give the same results n k = n i , i = 1,2,� � �,k −1. From this, n i = n/k and p i = 1/k result for i = 1,2,� � �,k; i.e. the maximum entropy solution is identical with the asymptotic minimum likelihood solution.

Conclusions
Five methods contributed to a considerable improvement in the theory of contingency tables: (1) the use of the LD measure, (2) the treatment of a table as a multinomial distribution, (3) the use of algebraic software, (4) the consequent utilization of linear programming, and (5) the application of the maximization of entropy under restraints.
Using the linkage disequilibrium parameter D as a measure of association between two categorical variables, which is essentially the determinant D = p 11 p 22 −p 12 p 21 of a 2×2 table, sufficed to quantify Simpson's paradox. The difference between a two-way interaction and the averaged partial interactions for the categories of a third variable was derived. For a 2×2×2 table, the difference was D 1;2 À � D 1;2j3 ¼ D 1;3 D 2;3 =½p 3 ð1 À p 3 Þ�. It became particularly clear that the agreement of D 1,2 and � D 1;2j3 can only arise when the third variable is independent of the first or second one (because D i,j = 0 means independence between variables i and j).
In many experiments, there is one response variable together with two explanatory variables. The latter ones can be arranged to ensure D 2,3 = 0, for example, by applying the treatments to the same fraction of males and females. In this way, the difference D 1;2 À � D 1;2j3 is zero and Simpson's paradox is circumvented.
However, with unplanned experiments or with two or three response or random variables, Simpson's paradox, which is essentially D 1;2 6 ¼ � D 1;2j3 , is to be expected. It is of particular interest that, with knowledge of merely one-and two-way parameters, implications for the three-way structure are possible. This was demonstrated with the Berkeley data and could be shown with other real data reflecting Simpson's paradox.

PLOS ONE
With the log-linear model, a variety of important hypotheses can be tested. However, practically relevant hypotheses were not the focus. One such hypothesis is that the actual degree of interaction between two categorical variables is the same within all levels of another categorical variable. This study derived a model for this hypothesis and applied it to the Berkeley data. It was possible to show that the model of agreeing associations between sex and admittance (measured with Pearson's correlation coefficient) within the departments does not fit the data. However, with a refinement, it could also be shown that just one department caused the heterogeneity. Within the other departments, there was no significant association between sex and admittance.
There is a need for further investigations into an optimal systematic strategy to find a parsimonious model. However, the new approach presented here is suitable for answering important questions and surely enriches the theory of contingency tables.
Tables with zero counts provoke the question of whether these counts appeared by chance or whether they were a necessary consequence of the given two-or more-way marginal totals. The application of linear programming provided a much simpler and successful way to obtain fixed zero counts than other methods used so far. Furthermore, fixed nonzero counts can be determined; thus, the number of independent variables (degrees of freedom) could be further reduced.
Improvements were achieved for the simulation of categorical variables with given relationships, and the restrictions of formerly used procedures could be circumvented.
In this study, Good's [27] investigations on maximum entropy under restraints were repeatedly used. Maximizing entropy under restraints means determining the table or multinomial distribution that is characterized by a minimum of information, the largest disorder, or as-uniform-as-possible cell frequencies under given assumptions. It allows the numerical determination of hypothetical tables by incorporating the equations of the hypotheses. It was recalled that, with appropriate hypotheses, the results of the maximum entropy principle agree with those of the MLEs of the log-linear model. Recent doubts about the validity of hierarchical log-linear models could be eliminated.
The relation between Bartlett's multiplicative and Bennett's additive measure of the threeway interaction was investigated. As it turned out, Bennett's measure is a simplified version of the first-order Taylor expansion of Bartlett's measure. Since Bartlett's measure (which is in concordance with the maximum entropy principle and with the log-linear model) has a deeper meaning than Bennett's measure, it is concluded that Bartlett's measure is the first choice. When an easy-to-calculate measure is preferred, the full first-order Taylor expansion should be applied instead of Bennett's measure.
It was shown for contingency tables that the concept of entropy is related to the likelihood principle for the multinomial distribution. In particular, a hypothetical table with maximum entropy under linear restraints (like the given marginal totals) and a table with minimum likelihood under the same restraints are similar but not identical. The tables at the bounds of the admissible region yield local minima of entropy and local maxima of the likelihood function.
It is hoped that applicants feel encouraged to test not only the classical hypotheses but also those of particular interest and that theoreticians further improve the suggested methods.

Acknowledgments
The author thanks two referees of an earlier version of this manuscript for valuable suggestions.